feat(q4_0): FP32→Q4_0 quantizer (loader-agnostic production)#651
Merged
Conversation
Adds Q4_0Quantizer in commonMain — the produce side Q4_0 was missing (it was decode-only, since GGUF arrives pre-quantized). Now any source of dense FP32 weights — a SafeTensors/JSON loader, an in-memory tensor, an offline tool — can emit canonical ggml Q4_0 blocks without GGUF. Algorithm matches ggml quantize_row_q4_0: per 32-element block, scale d = max/-8 (max = signed max-magnitude element), code = clamp(round( x/d + 8), 0, 15), packed in the canonical split layout; scale stored as round-to-nearest FP16. Tests: - Q4_0QuantizerTest — round-trips through Q4_0TensorData.toFloatArray within 4-bit error, recovers the max element, zero stays zero. - Q4_0QuantizeRoundTripMatmulTest — quantized weights run through the matmul dispatch and track the dense FP32 result, proving the quantizer output is consumable by the (scalar/Panama/native) kernels. Note: automatic on-load quantization via a loader policy is deliberately NOT wired here. DTypePolicy targets logical DType, not TensorEncoding, so requesting "Q4_0" needs a new encoding-policy type — an RFC-level API decision (parallel to #615) the maintainer should own. This PR ships the reusable primitive every such path would call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase B — the produce side of Q4_0. Q4_0 was decode-only (GGUF arrives pre-quantized); this adds
Q4_0QuantizerincommonMainso any source of dense FP32 weights — a SafeTensors/JSON loader, an in-memory tensor, an offline tool — can emit canonical ggml Q4_0 blocks without going through GGUF. This is the loader-agnostic primitive that "Q4_0 from any loader" actually requires.What
Q4_0Quantizer.quantizeToBytes(FloatArray)/.quantize(FloatArray, Shape): Q4_0BlockTensorData.quantize_row_q4_0: per 32-block,d = max/-8,code = clamp(round(x/d + 8), 0, 15), canonical split packing, FP16 round-to-nearest scale.Tests
Q4_0QuantizerTest— round-trip within 4-bit error, max-element recovery, zero-block, validation.Q4_0QuantizeRoundTripMatmulTest— quantized weights run throughctx.ops.matmuland track the dense FP32 result, proving the quantizer output is consumable by the scalar/Panama/native kernels (Phase B ↔ Phase A).Deliberately deferred
Automatic on-load quantization via a loader policy is not wired here.
DTypePolicytargets logicalDType, notTensorEncoding— so requesting "Q4_0" needs a new encoding-policy type, an RFC-level API decision (parallel to #615) the maintainer should own. This PR ships the reusable primitive every such path would call; the policy hook is a clean follow-up.Targeting 0.27.0. Stack #647→#650 already merged to develop; this branches off develop. Next: PR5 (docs).
🤖 Generated with Claude Code